Skip to content

Fix double count#2088

Open
cristiklein wants to merge 12 commits intolycheeverse:masterfrom
cristiklein:fix-double-count
Open

Fix double count#2088
cristiklein wants to merge 12 commits intolycheeverse:masterfrom
cristiklein:fix-double-count

Conversation

@cristiklein
Copy link
Contributor

@cristiklein cristiklein commented Mar 13, 2026

Fixes #2072

@mre Is this what we want? Notice that de-duplication happens over Request.uri so the span, element and attribute of URIs from different sources are lost.

Copy link
Member

@katrinafyi katrinafyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the DashSet lead to significant non-determinism? If a failing URL is in multiple input files, it would only be reported in one file. The user would fix that, re-run lychee, and then see a new failure for the same URL in a different file. This seems bad.

Another approach would be to change the counter tracking so a cached entry does not increment the other counters (at https://github.com/lycheeverse/lychee/blob/fe604ba9adffa0e59554c6d65641cb300127f0d0/lychee-bin/src/formatters/stats/response.rs )

@cristiklein
Copy link
Contributor Author

@katrinafyi By reading your comment, I realized that I went completely in the wrong direction. The ask was not to reintroduce the DashSet, but to count unique URLs. Thanks for pointing me in the right direction.

I now changed the PR to do exactly that. See 75d9f9f

So now we have:

$ cargo run -- fixtures/double-count/
[...]
🔍 6 Total (in 4ms) 🔗 3 Unique ✅ 6 OK 🚫 0 Errors

@afalhambra-hivemq Would you have time to try out and check if this matches your expectations?

@mre
Copy link
Member

mre commented Mar 14, 2026

Ooh, the number of unique urls is nice.

@katrinafyi
Copy link
Member

I wonder if it's possible to use the Status::Cached enum variant to determine whether to increment or not. It would change the metric from "unique" to "non-cached".

I just think it would be nice to avoid another hashset because we already have caches at several layers.

@cristiklein
Copy link
Contributor Author

I wonder if it's possible to use the Status::Cached enum variant to determine whether to increment or not. It would change the metric from "unique" to "non-cached".

I just think it would be nice to avoid another hashset because we already have caches at several layers.

I agree it would be nice to avoid another HashSet.

I tried displaying stats.cached in addition to stats.unique but the results look rather unexpected:

$ cargo run -- fixtures/double-count/
   Compiling lychee v0.23.0 (/home/cklein/elastisys/lychee/lychee-bin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.30s
     Running `target/debug/lychee fixtures/double-count/`
6/6 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                               
🔍 6 Total (in 4ms) 🔗 3 Unique ♻️ 0 Cached ✅ 6 OK 🚫 0 Errors

$ cargo run -- https://elastisys.io/ https://elastisys.io/welkin/
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.27s
     Running `target/debug/lychee 'https://elastisys.io/' 'https://elastisys.io/welkin/'`
216/216 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                             [WARN] lychee detected 3 redirects. You might want to consider replacing redirecting URLs with the resolved URLs. Run lychee in verbose mode (-v/--verbose) to see details about the redirections.
🔍 216 Total (in 8s 582ms) 🔗 180 Unique ♻️ 24 Cached ✅ 213 OK 🚫 0 Errors 🔀 3 Redirects

Notice that 216 (Total) - 24 (Cached) == 192 Non-Cached != 180 Unique. I wonder if this indicates a bug in caching or if the number of non-cached requests and number of unique links are semantically different. 🤔

I see a few options moving forward:

  1. Only display number of unique links, to avoid confusion, using the HashSet.
  2. Display both non-cached and unique.
  3. Try to figure out why non-cached != unique.
  4. 2+3

I'm leaning towards 2 now and 3 later.

@katrinafyi
Copy link
Member

katrinafyi commented Mar 16, 2026

If the goal was to avoid the hash set, displaying both non-unique and cached counts still keeps the hash set around.

As a first guess, the different count could be due to multiple simultaneous requests to the same URL. Those get served by the second-layer cache (#2067) rather than the top-level cache, and the second later cache doesn't return Status::Cached for its cache hits.

@cristiklein
Copy link
Contributor Author

If the goal was to avoid the hash set, displaying both non-unique and cached counts still keeps the hash set around.

As a first guess, the different count could be due to multiple simultaneous requests to the same URL. Those get served by the second-layer cache (#2067) rather than the top-level cache, and the second later cache doesn't return Status::Cached for its cache hits.

Thanks for the explanation. I'm getting the feeling that "unique links" and "non-cached links" have different semantics. "unique links" measures the structure of the website and is "ground truth" which could also be measured by other tools. In contrast, "non-cached links" feels more like a metric measuring lychee's behavior.

As far as I understood, the ask in the original issue was about unique links, in which case I see the added HashSet as beneficial and necessary.

I propose we add unique links with HashSet for semantic consistency.

What do you think?

@katrinafyi
Copy link
Member

I don't have strong feelings. It could be said that total links (including duplicates) is also a true value independent of lychee. Non-cached links could be useful as a measure of how much work lychee actually did. Unique links is probably more intuitive to understand.

I'll leave it to you and others to decide, I don't have a stake here aside from not liking the hash set (but I can deal with it).

@afalhambra-hivemq
Copy link

@afalhambra-hivemq Would you have time to try out and check if this matches your expectations?

Just to clarify here. My ask was exactly that, to have a unique URL count. Otherwise, a total URL count jump from 4K to 113K, like we noticed, sounds very misleading.
Am also fine having a count regarding the non-cache URLs count as well. The more detail, the better. So we can configure and adjust Lychee setup accordingly based on this information.

If that's ok, I can give it a try and provide you with some feedback. Thanks for looking into this!

@afalhambra-hivemq
Copy link

Tested locally against our documentation site (~489 HTML pages, same site referenced in #2072).

Build: cargo install --git https://github.com/cristiklein/lychee.git --branch fix-double-count lychee

Command:

lychee --insecure -c ./links-check.toml './build/site/**/latest/**/*.html' \
  './build/site/hivemq-edge/**/*.html' \
  './build/site/hivemq-platform-operator/**/*.html' \
  './build/site/hivemq-operator/**/*.html' \
  './build/site/hivemq-cloud/**/*.html' \
  './static/llms.txt' './static/llms-full.txt'

Results:

| Status         | Count  |
|----------------|--------|
| 🔍 Total       | 106097 |
| 🔗 Unique      | 4190   |
| ✅ Successful  | 105443 |
| ⏳ Timeouts    | 0      |
| 🔀 Redirected  | 102    |
| 👻 Excluded    | 212    |
| ❓ Unknown     | 0      |
| 🚫 Errors      | 340    |

The new Unique row is exactly what we needed — the 4,190 unique URLs aligns with the ~3,800 we saw on v0.21.0 (difference is expected from content growth). The errors are expected local file references (JavaDoc/REST API specs not generated in this build).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Summary 'Total' count inflated after removal of cross-file deduplication

4 participants